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1. INTRODUCTION 

The problem of incorrect text borrowings is relevant for the field of education and scientific research [1]. 
According to the materials of the study by Al-Janabi et al. [2] conducted in 2019, more than 1,500 dissertations 
on historical sciences defended in Iraq after 2,000 contain significant borrowings from other dissertations. For 
the task of detecting borrowings within one language, industrial tools show high search completeness [1], 
whose work is based on the representation of documents in the form of a set of overlapping word-by-word n- 
grams (shingles) [3]. This approach allows you to effectively search for exact text borrowings but does not 
allow you to detect borrowings with a large proportion of paraphrased text or with inserts of text translated 
from another language. There are several approaches that describe the problem of finding translated borrowings 
for some pairs of languages [4], [5], for example, for the Spanish-English pair. This work is devoted to the 
detection of translated borrowings for Arabic-English pairs of languages. This pair is rarely found in the 
literature and is not related. The choice of a pair of languages Arabic-English is due to the predominance of 
English-language publications on the Internet and a better knowledge of this language compared to others. 
Similar to the works [6], [7], this article offers a description of the algorithm for the full cycle of borrowing 
search-first, the candidate documents are searched for in an external collection, then they are compared in detail 
with the document being checked. An algorithm is proposed based on monolingual analysis of documents, 
similar to the one carried out in [8], [9] the document being checked is translated into English using a machine 
translation system with the further comparison of text fragments inside the documents. Several works devoted 
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to the search for translated borrowings use additional resources, such as thesauri and ontologies. The authors 
suggest using knowledge bases to extract information about the proximity between texts [4], [5]. In work 
Kaliappan et al. [5], an algorithm based on a combination of neural networks and knowledge graphs is 
proposed. The main disadvantage of this approach is resource intensity: the use of multilingual ontologies and 
knowledge bases requires large computational capacities to build semantic graphs for each text fragment, as 
well as to compare the obtained semantic graphs. In this work, we propose a decomposition of the algorithm 
for detecting translated borrowings for searching through large text collections [10]. 


2. SEARCH FOR CANDIDATE DOCUMENTS 

The most relevant candidate documents are found for the document being checked, for this purpose, 
a modification of the Shingle algorithm is used. The text is divided into fragments, each frame is mapped into 
a vector space. For each vector of the document being checked, the nearest vectors from the candidate 
documents are found, after which the pairs of these vectors are classified into similar and dissimilar pairs of 
text fragments. Since the proposed algorithm uses monolingual analysis of borrowings, the problem is close to 
the problem of detecting a paraphrased text. Several approaches [9], [11]-[14] to solving this problem use 
vector representations of phrases obtained using deep learning neural networks. The paper [13] proposes a 
neural bag of words (English Neural Bag-of-Words) and deep averaging networks (English deep averaging 
networks). 


3. DOCUMENT COMPARISON 

In this article, it is proposed to use the outputs of a neural network as vector representations of text 
fragments for a further approximate algorithm for finding the nearest neighbor [15]. The paper investigates the 
properties of the proposed method for detecting transferable borrowings. The analysis of deep learning models 
used at the stage of comparing documents, as well as a composite optimized function, is carried out. The quality 
of the proposed method is verified both on a synthetic sample and articles from journals. An error analysis is 
performed. The proposed method of finding borrowings is compared with the basic algorithm for finding 
borrowings based on the use of machine translation and the shingle algorithm. Problem statement Let the 
collections of documents in English be given D, = {d2}je1 and Arabic D, = {d‘,}§_,. Documents in Arabic 
and English are represented as a concatenation of text fragments: 


d; = i U..U i, | a [sia (ayes Weve | Let the sample be given: 


D = {(de, dq ), AL" }i=1 


: lal 
where each pair of documents (dé, dg aleve dieDq 


The list of fragment pairs is compared: 


2 Gebsvet l l 
AL = [Ghests ), al Sea pote | 


for each pair (s2,, 54, ), itis known that the fragment , sj, is a translation of the fragment s4,,. 
The model f is defined as the sequential execution of the filter and comparison functions, where filter: 


i ; retrieved; 
(dg De)aieD, > D, c De 


comparison: 


i retrievedj, | i 
(d, ,D. JaieD, > AL 
Here AL’ isa list of fragment pairs. The filter function is responsible for narrowing the number of documents 
in the collection compared with the document being checked and allows for furthermore detailed comparison 
of comparison using resource-intensive computational algorithms based on deep learning models. The quality 
of the model f is evaluated using the precision function and recall: 
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[(ufe1 ALY) 0 Cui, AL! 


L l M i 
cot any d} Recall — |( Vier AL!) 9 (Ujz, AL!)| 


Precision = 
| (Ujz, AL| 


it is required to find the function f that maximizes F'1, the harmonic mean of the precision and recall indicators: 


a 2Precision : Recall 


f = argmax F1(f,D), F1 Where f is a given family of models. 


Precision + Recall 


4. SEARCH FOR CANDIDATE DOCUMENTS 

One of the algorithms for searching for candidate documents in the problems of detecting verbatim 
borrowings and searching for almost-duplicates of text is an algorithm based on the construction of an inverted 
index, in which each document in the collection is represented by a set of shingles [3], that is, a set of 
overlapping n-grams. The document being checked is also divided into shingles, after which the documents 
are searched for by the inverted index with the largest number of shingles. In this paper, we propose a 
generalization of the shingles algorithm, which allows us to improve the quality of the search for candidates in 
the case of detection of transferable borrowings. The filter function is proposed as follows: 


(di, D.) arg max >. > [he H(d2)|/(\@ ED he H(d2)| 2 + const) 


dep, heH(ah) 


here H is a set of N-grams of the document, an ordered sequence of N cluster labels, where the procedure for 
forming clusters is described below; a € A; K is an optimized hyperparameter. To reduce the impact of 
translation ambiguity on the search for candidate documents, it is proposed to replace words with the 
corresponding cluster labels: {x1 , ...,%,} > {class(x,),...,class(x,) } = h where x, ,...,X, words. Clusters 
are pre-selected from the text corpus and contain semantically similar words. To reduce the ambiguity of the 
translation, before splitting it into N-grams, it is proposed to remove stop words from the text and carry out 
lemmatization. To account for possible permutations of words that occur after the translation of the text, the 
words inside each of the n-grams are sorted in lexicographic order. In this paper, a vector representation model 
of words based on the distributive hypothesis is used to obtain clusters. Clustering is performed using the cosine 
distance function: 


= (C1,C2) 
cos(¢1, C2) = Te iateah (1) 


where c; and c2 are vectors from the same vector space. Below are examples of the resulting clusters: 
i) [beers, beer, brewing, brew, brewery, brewed, pint, Guinness, stout, ipa, lager, ale, keg, pints]. and 


ii) [excellent, brilliant, best, exceptional, super, outstanding, and amazing]. To compare the found candidate 


trieved; : i : : 
documents De en’X and the document being checked(d‘,), a vector representation model of the phrase is 


used-the texts are divided into fragments and the corresponding vectors are compared. Below are the details of 
the comparison algorithm, as well as an analysis of the proposed optimization problem. 


4.1. The model of the vector representation of the phrase 

Let’s take a closer look at the stage of building the mapping of a fragment into a vector. Let each word 
of the document in the language of the collection be associated with a vector v € A“ of dimension u. For 
simplicity, we will assume that all fragments in the collection language have a bounded length ncol. Then the 
vectorization model of the fragment is the mapping h:W € A™*"col + A“, where W is the parameter space 
of the model. Objects from the set AY * "ol are the sequential concatenation of vectors of vector representations 
of words for sample fragments: 


sae = ener oe ber A aa 


To work with fragments less than n,,; in length, we define some vector denoting an empty word. The 
model is optimized in the partially supervised learning mode. As an optimized function, a composite error 
function is used, which is the sum of the reconstruction error and the indentation error: 


aE ec (Xrec W) + ql _ a) Eme(Xme W) Pe min (2) 


where E,.., is the reconstruction error; F,,.-indentation error; X;., and (Xj. are training samples; w -model 
parameters; a is a tunable hyperparameter. Let’s consider in more detail each term of the error function. The 
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first term of the error function corresponds to the autoencoder model. Let a sample X,.., C A“ * "co! be given. 
The h model acts as an encoding function for information about the X;., sample. Let also be given an auxiliary 
decoding function g that restores the original vector representation of x from the outputs of the model. 


h:r(x,w) = g6.,w) oh(x,w) = x,x € AY* Neol 


The minimized reconstruction error looks like this: 
1 
Exec (XrecW) = Xrecl LxeXpecllX r(x,w)||3. (3) 


the choice of the reconstruction error as an optimized function can be justified using the results of the article [16]. 
We will use the results proved in [16], where it was shown that an auto-encoder with a special type of 
regularization allows us to estimate the distribution of p (X) objects belonging to the general population. 
Theorem | [16]. Let p be a differentiable probability density and V x; € A“ * "co! p(x;) # 0. Let L,2be a loss 
function of the form: 


da(x,w) ¢ 
ax 5 dx 


ee | p(x) E — a(x,w)|15+ o? 


AUXN col 


using the results of Theorem 1, we can make the following statement. 
Theorem 2. The probability density is represented as: 


a i < E(x), where (1/Z)exp (—E(x)), Z is the normalization constant. 


Proof: 
a,2(x,W) =x+ 0? = log p(x) + 0(07); 


a2 (x,W)-x 


= = logp(x) + 0( 1): 


o2 


a2 (x,W)-x 


a 
ms 5, 0s P(X) ; 


o2 
Representing log p(x) in the form—E(x) — log Z, we get the desired expression. Thus, when the 
regularizer o tends to zero, a language model is obtained, i.e., the probability distribution of a set of text 


sequences. The second term of the composite error function is the indentation error [17]. To optimize this error 
function, a sample of Xme = {(x;, x;)} consisting of pairs of objects is used: 


Xma = Xai XB.) © AUX Ment x AUX ects 


E= ree OC oe max(0, 5 — c_) + max(0,6 — cx)) (4) 


c_ = cos (A(x;, w), h(x;, w) — cos (h(x%;, w), h(x, w)); 
Cy = COS (A(x; w), h(x, w) — cos (A(x;, w), h(x;1,w)); 
6-indentation; cos-distance (1), 


x)= argmax cos (%;,X;') 
xe XB xp tx; 


xj) = argmax cos (%;,x;') 
xe XA x BX; 
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The following theorem explains the behavior of this term during the optimization of the parameters w 
of the model h. Theorem 3. Let the following conditions be satisfied: 
i). The hyperparameter 6 € (0, 2) is set. 
ii). The sampling power |X;¢| is limited to the following value: 


u-1 = 
Xrel(Xmel — 1) < Ve Lge 1%) site? ye dx) (5) 


r(3) 


iii). Subsamples X/,, me and X2,, contain all elements in the singular, no element occurs in both samples. Then 
there is a continuous mapping A from the set of vector representations of the words A¥ * "col to the vector space 
A", delivering the global minimum of the function Ey, = 0. Proof we construct the map A explicitly. Let’s 
say for each pair (x1, x2): A(x,) = A(x2) then the time function looks like this up to a multiplier. 


ES > max(0,6 — 1 + cos (A(x), A(x,))) + max (0,6 — 1 + cos (h(x), A(x;"))) 


(xi,xj)€Xme 


The range of values of the function is bounded from below by zero, which is achieved when the following 
GH 6 Xme (x',x) € 
Xme (x, x) € Xme- The number of pairs described above in the set X,,-when the third condition of the theorem 
is fulfilled is equal to |[Xmel([Xme| — 1). We assign the value of the mapping h for each such pair so that 
cos (h(x), h(x’) < 1— 6. The existence of such a map follows from the problem of finding a spherical code 
of maximum size for a sphere in a space of dimension u and the cos~1(1 — 4). In, a lower estimate for the 
sample size satisfying the specified conditions is presented. 

The estimate corresponds to the right side of the inequality (5). Since the X,,, sample is finite, it is 
possible to use interpolation polynomials to construct a continuous function given by the conditions described 
above, which was required to be proved. Note that the mapping proposed in the theorem is continuous, so 
neural network models can be used to approximate this mapping. According to Tsybenko’s n, mappings from 
the class of neural network models will approximate continuous models arbitrarily well [18]. Thus, the 
composite optimized function (2) allows us to obtain a model that, on the one hand, has generalizing properties 
that the language model (3) is responsible for, on the other hand, effectively separates similar and dissimilar 
phrases from the training sample (4). The hyperparameter « is responsible for the contribution of each of the 
optimized terms to this function. 


conditions are met: 1—6 > cos (h(x),A(x') For any pair x € XA,,x'€ X2., 


4.2. The classifier 

For each vector of the phrase h(x) from the document being checked d‘ is v nearest vectors by the 
cosine distance function (1) for fragments from candidate documents D?@"¥e¢ | using the method of 
approximate nearest neighbor search. The main purpose of this procedure is to reduce the number of pairs of 
fragments for classification to reduce the resource intensity of the document comparison stage. For a vector 


representation of a pair of (h(x2, J; h(x )) the following decisive rule is considered: 


1,cos (a(xd,), h(xi)) > tl, 
Fragments (i(x2y),R(%5)) =) p(((xd,),(xh))) > €2 (9) 
0 


where p is the probability of the classifier; t1 is the threshold of the cosine function of the distance (1); t2 is 
the minimum threshold of the probability of the classifier. The features are the concatenation of the difference 
modulo and the component-by-component product of the components of the vector [|h(x2 i) - 


A(xi)|, h(x2,) © h(xi)].The random forest model acts as a classifier. 


5. SYNTHETIC COLLECTION 

Documents from the English and Arabic versions of the Wikipedia website were used to generate 
translated borrowings. 100 thousand articles from the Arabic version of Wikipedia were used as a collection 
of documentaries [19]. A random subset of documents from the Arabic version of Wikipedia was used as a 
collection of D ,documents to be checked. To generate borrowings for each document d!, € D , the following 
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algorithm was used: i) Select candidate documents {d2} from the D , collection. To reduce the spread of 
vocabulary in the candidate documents and the document is checked, the selection of candidate documents was 


carried out from a subset of the 500 most relevant documents for the document being checked d I To determine 
the relevance measure was used. The number of candidate documents was randomly selected from | to 10; 1i) 
Select proposals from the candidate documents {di } randomly and translate them into Arabic and iii) Replace 
random sentences from the document you are checking d!,with translated sentences from candidate documents. 
The share of replaced sentences from the document being checked was selected randomly from 20% to 80%. 


5.1. Optimization of the parameters of the considered models 

The fast text library [20] was used as a model for the vector representation of words, the optimization 
of the parameters of which was carried out on the English version of Wikipedia. The dimension of the vector 
space for the vector representation of words and fragments was set as 100. To optimize the model of the vector 
representation of text fragments, the AdaDelta algorithm was used with the parameters « = 107°, = 0.94 
regularization 1, = 10~°. For the final loss function (2), the following values of hyperparameters were set: 6 = 
0.3; a = 0.1. The classifier thresholds (6) were selected based on the cross-validation procedure: t1 = 
0.6; t2 = 0.5. Agglomerative clustering on word vectors was used to build clusters. The cosine function of the 
distance (1) between the corresponding vector representations was considered as a measure of the proximity 
of words. The final model contained 29 thousand clusters for 776 thousand words. The recurrent model GRU 
(gated recurrent unit) was used as model for encoding h and decoding [21], [22]. He was used as a machine 
translation system, the model of which was trained on 18.4 million parallel sentences from Opus corpora [23], 
[24]. 10 million sentences from the English version of Wikipedia were used as a sample to minimize the E,¢, 
(3) reconstruction error. The second term of the loss function (4) uses information about similar 
sentences Xmoe = {(%;,x;)}. Pairs of parallel sentences from the OpenSubtitles corpus were used as a sample 
of such sentences [25]-[27]. 


5.2. Details of the computational experiment 

Three experiments were conducted on synthetic data. i) Search for candidates. In this experiment, the 
quality of the obtained model of word clusters was analyzed. As a basic experiment for comparison, an 
algorithm based on shingles without reducing words to cluster labels was considered; ii) Comparison of text 
fragments. In this experiment, we considered the case when the selection of candidates was carried out 
completely correctly: Recall@10 = 1.0. The shingle-based algorithm also served as the basic algorithm: the 
document being checked d 4, was translated into English. After that, the resulting text was lemmatized and 
divided into a set of overlapping 4-grams. To account for possible permutations of words when translating, the 
words inside each 4-gram were sorted. The result of comparing two documents was a set of matching sorted 
4-grams and iii) An experiment evaluating the quality of the entire algorithm (search for candidates and 
comparison of text fragments). 

This experiment allowed us to evaluate the quality of the presented algorithm. The results of the 
candidate search experiment are presented in Table 1. The presented algorithm based on the construction of 
clusters gives better quality than the basic algorithm based on shingles. The results of experiments comparing 
text fragments are presented in Table 2. The presented algorithm shows an accuracy comparable to the accuracy 
of the basic algorithm and completeness significantly exceeding the completeness of the basic algorithm. The 
accuracy of the basic algorithm is explained by the fact that this algorithm takes into account the similarity of 
only almost duplicates of the text. In the third experiment, which took into account the quality of the presented 
algorithm as a whole, the following indicators were obtained: Precision=0.82; Recall=0.78; and F1=0.79s. 


Table |. Results of the candidate search experiment 
Algorithm — Recall@10 
Basic 0.88 
Presented 0.90 


Table 2. Results of experiments on searching for similar text fragment’s 
Algorithm Precision Recall Fl 
Basic 0.93 0.14 0.25 
Presented 0.87 0.79 0.84 
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6. RESULTS OF EXPERIMENTS ON A REAL COLLECTION OF SCIENTIFIC DOCUMENTS 

To test the presented algorithm, an experiment was conducted to search for transferable borrowings 
on a collection of documents from an electronic library library.iugaza.edu.ps. This resource also contains 
additional metadata for each document: the title, the authors of the document, the language of the document 
and belonging to the subject corresponding to the state heading of scientific and technical information. 1.5 
million documents in Arabic were prepared for testing the algorithm as verifiable Dr documents. As a collection 
of D, documents, documents from the English version of Wikipedia were used, documents in English from the 
articles of the resource arXiv.org. 

The total number of documents received was 2.4 million. Due to the large number of documents being 
checked, documents containing a significant number of found borrowings were considered for further analysis. 
9 thousand documents with a significant number of borrowings were received. Of these, 5.3 thousand 
documents were analyzed, selected randomly. The main purpose of the experiment was to detect translated 
borrowings when the borrowing occurred from an English-language document to an Arabic-language 
document. At the same time, the analysis of the obtained results revealed several other positives of the 
presented algorithm, which was further divided into several types: 1) conversion of borrowing-the document 
contains drawing translated from English issued for the original text; M other borrowings-borrowings from 
Arabic resources or borrowing, the direction of which cannot be determined by document date; 11) bilingual 
articles-works of the same author in two languages; iii) self-citation-citation by the author of his English- 
language work; iv) citing laws-using the wording regulations; and v) erroneous responses-false-positive 
responses of the presented algorithm. 


7. CONCLUSION 

The article discusses a technique for detecting transferable borrowings. The technique for finding 
transferable borrowings is decomposed, allowing for an efficient search of large text collections for 
borrowings. The proposed strategy for detecting borrowings is analyzed in detail, as is the composite error 
function used to optimize the deep learning model. To analyze the quality of the presented algorithm, 
experiments have been conducted on synthetic data for a pair of languages Arabic-English. The quality of the 
algorithm is also demonstrated in the collection of Arabic-language documents. In the future, it is planned to 
develop the proposed algorithm using a model of vector representation of sentences for the task of searching 
for candidates and improving the quality of the display that matches the vector phrase. 
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